The first plot shows all the environment indicators from both the current studies and the original framework in the y-axis. Purple indicates that the indicator is only being used in the current studies, orange that it is only included in the Wiltshire framework, and green that the indicator is used in both the framework and current studies.
The x-axis shows the number of secondary data metrics that have been collected to represent those indicators. You can see that there are some indicators for which there exist many data, but many indicators for which I have found little to represent them.
NASS figures are used to cover on-farm water use, energy efficiency, and acres in conservation practices. I used the National Aquatic Resource Surveys aggregated at the state level to measure water quality. Land use diversity is pretty well represented by Multi-Resolution Land Characteristics LULC layers, which I also aggregated at the county level. Greenhouse gas emissions come from EPA figures by state, broken down by economic sector. Finally, the USFS TreeMap dataset accounts for aboveground biomass and would do reasonably well in tree vigor. There is more to pull out here than I have so far.
Otherwise, if anyone has ideas for secondary datasets to cover the rest of the indicators, please do let me know.
Code
pacman::p_load( dplyr, ggplot2, stringr, plotly, RColorBrewer)## Load data for tree and metricsenv_tree <-readRDS('data/trees/env_tree.rds')meta <-readRDS('data/sm_data.rds')[['metadata']] %>%filter(dimension =='environment')# Format to match Wiltshire frameworkmeta <- meta %>%mutate(indicator =str_to_sentence(indicator),indicator =case_when(str_detect(indicator, '^Above') ~'Aboveground biomass',str_detect(indicator, '^Water') ~'Water use / irrigation efficiency',TRUE~ indicator ) ) # Counts of secondary data metricscounts <- meta %>%group_by(indicator) %>% dplyr::summarize(count =n())# Join to Wiltshire frameworkcolors <- RColorBrewer::brewer.pal(n =3, name ='Dark2')dat <-full_join(env_tree, counts, by =join_by(Indicator == indicator)) %>%mutate(count =ifelse(is.na(count), 0, count),label_color =case_when( Use =='both'~ colors[1], Use =='wiltshire_only'~ colors[2], Use =='current_only'~ colors[3] ) )# Plotdat %>%ggplot(aes(x = Indicator, y = count)) +geom_col(color ='black',fill ='grey' ) +geom_point(data = dat,aes(x =1, y =1, color = Use),inherit.aes =FALSE ) +scale_color_manual(name ="Indicator Use:",values =c("both"= colors[1],"current_only"= colors[3],"wiltshire_only"= colors[2] ),labels =c('Both','Current Only','Framework Only' ) ) +theme_classic() +theme(axis.text.y =element_text(color = dat$label_color),legend.position ="bottom",plot.margin =margin(t =10, r =75, b =10, l =10) ) +guides(color =guide_legend(override.aes =list(size =3)) ) +coord_flip() +labs(y ='Secondary Data Count')
Bar Plot of Indicators
2 Maps
Taking a quick tour through some of the spatial data here. Most of these metrics will also be available to peruse on the Shiny app, with the exception of those that are hard to aggregate, like biodiversity hotspots.
Land Use Diversity
This is the MRLC 30m LULC layer from 2023. Below the map, you can find a table with codes and descriptions. Sort or expand to see all the values.
LULC Diversity is derived from the MRLC LULC layer above. LULC types are aggregated by category (water, developed, barren, forest, shrubland, herbaceous, cultivated, wetlands) and Shannon diversity is calculated for each county. It makes for an interesting metric, but I’m not sure it makes for a strong normative metric. If anyone has thoughts on what the “right” amount of LULC diversity is, I’d love to hear from you.
This biodiversity hotspot map was being put together around the same time as the Y2k crisis. Even if this were more recent and throughout New England, incoporating this kind of data into the framework seems a bit fraught.
The TreeMap 2016 dataset is quite comprehensive national survey of forest health and diversity. Updates are infrequent, but this is the best layer I’ve found to address biomass. The raster is at 30m.
Shown below is the mean live above-ground biomass aggregated by county so that it plays well with other metrics. Note that it is measured in tons per acre of forest, non-forest cells were removed from analysis. So, it is not showing density of forest, just biomass in existing forest. This is why the more urban counties still show a reasonable density of live biomass. There is lots more that can be pulled out of this dataset, like dead/down carbon, tree stocking, live canopy cover, height, volume, tree per acre, etc. More info can be found here.
Note that while most of the available secondary data is at the county level, the environment dimension includes a fair amount at the state level as well. This includes greenhouse gas emissions and water quality surveys. For now, I’ll just show these separately, but some creative aggregation will have to happen eventually.
Code
pacman::p_load( dplyr, purrr, ggplot2, rlang, ggpubr, tidyr)source('dev/data_pipeline_functions.R')source('dev/filter_fips.R')metrics <-readRDS('data/sm_data.rds')[['metrics']]metadata <-readRDS('data/sm_data.rds')[['metadata']]# Use metadata to get help filter by dimensionenv_meta <- metadata %>%filter(dimension =='environment')# Filter to economics dimensionenv_metrics <- metrics %>%filter(variable_name %in% env_meta$variable_name)# env_metrics$variable_name %>% unique# get_str(env_metrics)# Filter to latest year and new (post-2024) counties# And pivot wider so it is easier to get correlationsenv_county <- env_metrics %>%filter_fips(scope ='counties') %>%get_latest_year() %>%select(fips, variable_name, value) %>%mutate(variable_name =str_split_i(variable_name, '_', 1)) %>%pivot_wider(names_from ='variable_name',values_from ='value' ) %>%unnest(!fips) %>%mutate(across(c(2:last_col()), as.numeric))# get_str(env_county)## Plotplots <-map(names(env_county)[-1], \(var){if (is.character(env_county[[var]])) { env_county %>%ggplot(aes(x =!!sym(var))) +geom_bar(fill ='lightblue',color ='royalblue',alpha =0.5 ) +theme_classic() +theme(plot.margin =unit(c(rep(0.5, 4)), 'cm')) } elseif (is.numeric(env_county[[var]])) { env_county %>%ggplot(aes(x =!!sym(var))) +geom_density(fill ='lightblue',color ='royalblue',alpha =0.5 ) +theme_classic() +theme(plot.margin =unit(c(rep(0.5, 4)), 'cm')) } else {return(NULL) }}) # Arrange them in 4 columnsggarrange(plotlist = plots,ncol =4,nrow =11)
Distributions of economic metrics at the county level.
By State
Code
pacman::p_load( dplyr, purrr, ggplot2, rlang, ggpubr, tidyr)state_codes <-readRDS('data/sm_data.rds')[['fips_key']] %>%select(fips, state_code)env_state <- env_metrics %>%filter_fips(scope ='state') %>%get_latest_year() %>%select(fips, variable_name, value) %>%mutate(variable_name =str_split_i(variable_name, '_', 1)) %>%pivot_wider(names_from ='variable_name',values_from ='value' ) %>%unnest(!fips) %>%mutate(across(c(2:last_col()), as.numeric)) %>%left_join(state_codes, by ='fips')# get_str(env_state)# Variables to map. Take out some that didn't come through well.vars <-names(env_state)[-1] %>%str_subset('lakesAcidCond|lakesCylsperEpaCond|lakesMicxEpaCond|state_code|waterIrrSrcOffFarmExp|waterIrrReclaimedAcreFt|waterIrrReclaimedOpenAcres',negate =TRUE )## Plotplots <-map(vars, \(var){ env_state %>%ggplot(aes(y =!!sym(var), x = state_code, color = state_code)) +geom_point(alpha =0.5,size =3 ) +theme_classic() +theme(plot.margin =unit(c(rep(0.5, 4)), 'cm'),legend.position ='none' ) +labs(x ='State' )}) # Arrange them in 4 columnsggarrange(plotlist = plots,ncol =4,nrow =17)
Distributions of environmental variables at state level
4 Bivariate Plots
Using a selection of variables at the county level. The variable names are a bit hard to fit in here, but from left to right across the top they are LULC diversity, mean live above-ground forest biomass, conservation income per farm, conservatino easement acres per farm, conservation tillage: no-till acres per farm, conservation tillage: excluding no-till acres per farm, and cover cropping: excluding CRP acres per farm.
Code
pacman::p_load( GGally)# Neat function for mapping colors to ggpairs plots# https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-valuesmap_colors <-function(data, mapping,method ="p",use ="pairwise", ...) {# grab data x <-eval_data_col(data, mapping$x) y <-eval_data_col(data, mapping$y)# calculate correlation corr <-cor(x, y, method = method, use = use) colFn <-colorRampPalette(c("blue", "white", "red"), interpolate ='spline') fill <-colFn(100)[findInterval(corr, seq(-1, 1, length =100))]# correlation plotggally_cor(data = data, mapping = mapping, color ='black', ...) +theme_void() +theme(panel.background =element_rect(fill = fill))}lower_function <-function(data, mapping, ...) {ggplot(data = data, mapping = mapping) +geom_point(alpha =0.5) +geom_smooth(color ="blue", fill ="grey", ...) +theme_bw()}# Rename variables to be shorterenv_county %>%select(LULC = lulcDiversity,Biomass = meanAboveGrndForBiomass, consIncomePF, consEasementAcresPF, consTillNoTillAcresPF, consTillExclNoTillAcresPF, coverCropExclCrpAcresPF ) %>%ggpairs(upper =list(continuous = map_colors),lower =list(continuous = lower_function),axisLabels ='show' ) +theme(strip.text =element_text(size =5),axis.text =element_text(size =5),legend.text =element_text(size =5) )
It looks like there are a few non-linear relationships, conservation income per farm in particular, but for the most part, linear relationships do a decent job here.
5 Correlations
Only showing correlations by county because we don’t have enough observations to run it by state.
Code
pacman::p_load( dplyr, tidyr, tibble, stringr, purrr, tidyr, ggplot2, plotly, reshape, Hmisc, viridisLite)# get_str(env_county)cor <- env_county %>%select(-fips, -lulcPropNoData) %>%as.matrix() %>%rcorr()# Melt correlation values and rename columnscor_r <-melt(cor$r) %>%setNames(c('var_1', 'var_2', 'value'))# Save p valuescor_p <-melt(cor$P)p.value <- cor_p$value# Make heatmap with custom text aesthetic for tooltipplot <- cor_r %>%ggplot(aes(var_1, var_2, fill = value, text =paste0('Var 1: ', var_1, '\n','Var 2: ', var_2, '\n','Correlation: ', format(round(value, 3), nsmall =3), '\n','P-Value: ', format(round(p.value, 3), nsmall =3) ))) +geom_tile() +scale_fill_viridis_c() +theme(axis.text.x =element_text(hjust =1, angle =45)) +labs(x =NULL,y =NULL,fill ='Correlation' )# Convert to interactive plotly figure with text tooltipggplotly( plot, tooltip ='text',width =1000,height =800)
Interactive correlation plot of metrics by county
6 Analysis
This section will serve as a first pass at using some methods in the literature to aggregate metrics. I should say at the start that we have a pretty narrow selection of metrics to work with so far, which do not do a great job at capturing the breadth of the dimension. I’m also working with just the county-level data here. This provides some opportunities to use data-driven analyses like PCA, but it is worth noting that these will not get us to the holistic, system-wide measurements of sustainability we are after without including some normative judgments as to how to combine geographic areas as well as our five dimensions. So, let’s just go through the motions here, see how the process unfolds, and note anything worth digging into more down the road.
Imputation
PCA requires complete data, so we either have to impute, delete, or use PPCA. I’m choosing to impute with missing forest here as it is pretty good at handling MAR and non-linear data, but PPCA is certainly worth exploring.
Code
pacman::p_load( missForest, tibble)# Wrangle dataset. Need all numeric vars or factor vars. And can't be tibble# Also removing character vars - can't use these in PCA# Using old Connecticut counties - some lulc data is missing for them thoughdat <- env_county %>%filter_fips('old') %>%select(fips, where(is.numeric)) %>%column_to_rownames('fips') %>%as.data.frame()# get_str(dat)# skimr::skim(dat)# Remove variables with most missing data - too much to impute.# Also remove the proportional LULC values - keeping diversity thoughdat <- dat %>%select(-matches('consIncome'), -matches('^lulcProp'))# Impute missing variablesset.seed(42)mf_out <- dat %>%missForest(ntree =200,mtry =10,verbose =FALSE,variablewise =FALSE )# Save imputed datasetimp <- mf_out$ximp# Print OOBmf_out$OOBerror
NRMSE
0.00142119
Standardization
Centering and scaling to give every variable a mean of 0 and SD of 1.
Code
dat <-map_dfc(imp, ~scale(.x, center =TRUE, scale =TRUE))
Now that we have standardized variables, we have to make normative decisions about what constitutes a good or bad value. This will certainly be a collaborative process where we seek input from teams to come to some kind of consensus once we have primary data. But until then, I’m going to make some heroic assumptions that LULC diversity is good, above ground forest biomass is good, conservation practices and easements are good, and fertilizer expenses are bad. Open to thoughts here as always.
With that, we can recode our normalized variables accordingly.
Code
normed <- dat %>%mutate(across(c(matches('^fert')), ~-.x))
Component Extraction
Determine the number of components to extract using a few tools: very simple structure (VSS), Velicer’s minimum average partial (MAP) test, parallel analysis, and a scree plot.
This scree plot shows the eigenvalues (unit variance explained) of each principal component (y-axis) against each component (x-axis). The first few components explain lots of variance, but there is a decent elbow around the fourth component.
VSS suggests 1 or 2, MAP suggests 8, parallel analysis shows 3. I’m going with 3 here, which will be explained further below.
Recommendations for creating composite indices are to extract components that each have eigenvalues > 1, explained variance > 0.10, and such that the proportion of explained variance for the total set is > 0.60 (Nicoletti 2000; OECD 2008).
Our total cumulative variance is explained is 0.74, and our component that explains the least variance is RC4 with 0.11. Note that extracting four or more components here gives us a component with less than 0.10, so this is why we are sticking to three. The first component (RC1) explains 38% of the variance in the data. The second component is respectable at 0.26, while the third is barely above the threshold at 0.11.
Looking at the metrics, we can see that the first component loads mostly onto the conservation practices, no-till acres, cover cropping, drainage, and total fertilizer expenses. The second component leads onto mean above-ground biomass (although there is cross-loading with the first component), operations with silvapasture, operations with easements, rotational grazing operations, and operations with fertilizer expenses. This seems to be catching more of the population-related metrics. The last component only loads onto a few metrics: easement acres, easement acres per farm, and silvapasture operations (which has some heavy cross-loading).
Aggregation
Here, we follow Nicoletti and calculate the normalized sum of square factor loadings, which represent the proportion of total unit variance of the metrics that is explained by the component.
Code
## Get metric weights following Nicoletti 2000# Pull out metric loadingsloadings <- pca_out$weights %>%as.data.frame()# For each set of loadings, get squares, and then normalized proportionssq_loadings <-map(loadings, ~ .x^2)metric_weights <-map(sq_loadings, ~ .x /sum(.x))head(as.data.frame(metric_weights))
Now we can use these to weight metrics and aggregate them into a component score for each county.
Code
# Component scores for each component across each countycomponent_scores <-map(metric_weights, \(x) {as.matrix(normed) %*% x}) %>%as.data.frame()head(component_scores)
It looks like they are reasonably similar, although RC2 and RC3 have substantially lower correlation coefficients. It will be worth noting this and coming back to explore the differences at some point.
For now, let’s keep following Nicoletti and aggregate the component scores into a single variable.
Curious that the component that accounted for the most variance is weighted the lowest. Worth doing a dive here at some point and figuring out why that is.
We will use these to weight each component to combine them.
Now that we have all three component scores and the dimension score, let’s take a look at a map. Select the data to display with the layer button on the left.